4/4/2024
In a simple linear model, we model a response \(Y\) as a function of some inputs \(X\) (our independent variables), along with an error term \(\epsilon\).
\[Y = \beta_0 + \beta_{1}X_1 + ... + \beta_{p}X_p + \epsilon\]
\(Y\) : our response variable
\(\beta_0\) : the intercept term
\(\beta_p\) : the coefficient associated with independent variable \(X_p\)
\(\epsilon\) : error (residual) term
Linearity: can the relationship between \(X\) and \(Y\) be approximated by a straight line?
Independence: are the errors (residuals) independent of each other?
Homoscedasticity: is the variance of the errors (residuals) constant as a function of \(X\)?
Normality: are the errors (residuals) normally distributed?
library(ggplot2)
library(cowplot)
set.seed(12345)
# number of observations
n <- 100
# generate x values from 1 to n
x <- 1:n
# generate an error term, normally distributed with mean 0 and SD 4
sd <- runif(n, min = 0, max = 4)
error <- rnorm(n, 0, sd*x)
# generate y values
y <- (x * 10) + error
# put x and y into dataframe
df = as.data.frame(cbind(x, y))
# plot
ggplot(df, aes(x = x, y = y)) +
geom_point() +
geom_smooth(method="lm", se=FALSE) +
theme_cowplot() +
labs(x = "X", y = "Y")library(HistData)
data(GaltonFamilies)
# subset to male children
galton_filt = subset(GaltonFamilies, gender=="male")
ggplot(galton_filt, aes(x=midparentHeight, y=childHeight)) +
geom_point(alpha=0.5, size=5) +
geom_smooth(method="lm", se=FALSE) +
theme_cowplot() +
labs(x="Mid-parent height", y="Child height")m = lm(childHeight ~ midparentHeight, data=galton_filt)
galton_filt$pred = predict(m)
ggplot(galton_filt, aes(x=midparentHeight, y=childHeight)) +
geom_segment(aes(xend = midparentHeight, yend = pred), color="firebrick") +
geom_point(alpha=0.5, size=5) +
geom_smooth(method="lm", se=FALSE) +
theme_cowplot() +
labs(x="Mid-parent height", y="Child height")Linearity: can the relationship between \(X\) and \(Y\) be approximated by a straight line?
Independence: are the \(Y\) observations independent of each other?
Homoscedasticity: is the variance of the errors (residuals) constant as a function of \(X\)?
Normality: are the errors (residuals) normally distributed?
What if these assumptions are violated?
In biology, we’re often interested in modeling data that don’t conform to the assumptions of simple linear regression.
Count data
Number of DNA sequencing reads aligned to a particular genomic position
Binary data
Survival after treatment with a new therapeutic compound
Discrete categorical data
Cell type identity in a large single-cell RNA-seq experiment
To model these data types, we can turn to generalized linear models.
\[\textcolor{teal}{\mu} = \mathbb{E}(Y | X)\]
\[\textcolor{violet}{\eta} = \beta_0 + \beta_{1}X_1 + ... + \beta_{p}X_{p}\]
\[\textcolor{olive}{g}(\textcolor{teal}{\mu}) = \textcolor{violet}{\eta}\] \[\textcolor{teal}{\mu} = \textcolor{olive}{g}^{-1}(\textcolor{violet}{\eta})\]
Whether a passenger on the Titanic survived given the amount they paid for their fare.
The “canonical” link \(g(\mu)\) for a logistic regression is \(\log\left(\frac{\mu}{1 - \mu}\right)\).
If we define our linear predictor as \(\eta\), we can do \(g^{-1}(\eta) = \frac{e^{\eta}}{1 + e^{\eta}}\) to ensure that our predictions match the expected distribution.
Assuming you already have ingredients #1 (an expected distribution for your \(Y\) values, denoted \(\mu\)) and #2 (a linear predictor, denoted \(\eta\)) there is a “canonical” link function for that distribution.
| GLM type | canonical link function | inverse of link function |
|---|---|---|
| Simple linear regression | \(g(\mu) = \mu\) | \(\mu = \eta\) |
| Poisson regression | \(g(\mu) = \log(\mu)\) | \(\mu = e^{\eta}\) |
| Logistic regression | \(g(\mu) = \log(\frac{\mu}{1 - \mu})\) | \(\mu = \frac{e^{\eta}}{1 + e^{\eta}}\) |
Note: You can pick any link function you like, even if it’s not the canonical or natural link — it just depends on how you want to model \(Y | X\)!
We’ve established that we need three things in order to fit a GLM:
An expected distribution for the \(Y\) values (\(\mu\))
A linear predictor (\(\eta\))
A “link” between the our linear predictor and the response variable (\(g\))
When we fit a GLM in R, what does this involve?
New mutations occur in germline (sperm or egg) cells and can be passed on to children.
In humans, we typically identify de novo mutations by sequencing trios (two parents and their child).
Figure 1
library(dplyr)
# read in DNM data
dnms = read.csv("https://raw.githubusercontent.com/quinlan-lab/ceph-dnm-manuscript/master/data/second_gen.dnms.txt", sep="\t")
# remove unphased DNMs
dnms = subset(dnms, phase != "na")
# group DNMs by sample id and phase
by_phase = dnms %>% count(new_sample_id, phase)
# group by sample alone to get totals
by_sample = dnms %>% count(new_sample_id)
by_sample$phase = "total"
merged = rbind(by_phase, by_sample)
ggplot(merged, aes(x=phase, y=n, col=phase)) +
geom_boxplot(fill="white") +
geom_jitter(size=2, width=0.15, alpha=0.5) +
theme_cowplot() +
labs(x="Parent of origin", y="Count")De novo mutation counts in Utah families. Data from Sasani et al. (2019) eLife.
We also expect de novo mutation counts to depend on parental age.
Each time a spermatogonial stem cell undergoes mitotic DNA replication, there’s an opportunity for mutations to occur.
DNA damage also accumulates in germ cells over time, leading to new mutations.
Let’s build a GLM to find out.
library(ggplot2)
library(dplyr)
library(cowplot)
dnms = read.csv("https://raw.githubusercontent.com/quinlan-lab/ceph-dnm-manuscript/master/data/second_gen.dnms.summary.csv")
ggplot(dnms, aes(x=dad_age, y=snv_dnms)) +
geom_point(size=5) +
labs(x="Paternal age", y="Number of DNMs in child") +
theme_cowplot()De novo mutation counts observed in children as a function of paternal age at birth. Data from Sasani et al. (2019) eLife.
Remember the first ingredient we need to fit a GLM:
An expected distribution for the \(Y\) values (\(\mu\))
Our model needs to appropriately model the mean and variance of our \(Y\) values. Since we’re dealing with count data, the Poisson makes sense.
The Poisson distribution tells us:
\[P(Y = k) = \frac{e^{-\lambda}\lambda^k}{k!} \ \text{for} \ k = 0,1,2...\]
where \(\lambda\) is the expected number of “events” (the mean we want to model in our GLM).
Now we need the remaining ingredients for a GLM:
A “link” \(g\) between the our linear predictor \(\eta\) and the conditional mean \(\mu\) of the response variable.
In Poisson regression, the canonical link function \(g(\mu)\) is the “log link”:
\[\log(\mu) = \beta_0 + \beta_{1}X_{1} + ... + \beta_{p}X_{p}\]
The log link makes sense for count data because it ensures that our predictions are always greater than 0.
Since the log link is the default for Poisson GLMs, this is the same as:
RLet’s look at the model summary:
Call:
glm(formula = snv_dnms ~ dad_age, family = "poisson", data = dnms)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.26034 -0.88413 0.02955 0.72784 3.04281
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 3.355454 0.083697 40.09 <2e-16 ***
dad_age 0.025808 0.002751 9.38 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 178.226 on 69 degrees of freedom
Residual deviance: 92.624 on 68 degrees of freedom
AIC: 512.22
Number of Fisher Scoring iterations: 4
ROne key difference is that the coefficients are on a different scale.
Recall that our Poisson GLM is:
\[\log(\mu) = \beta_0 + \beta_{1}X_{1} + ... + \beta_{p}X_{p}\]
This is the same as saying:
\[\mu = e^{\beta_0 + \beta_{1}X_{1} + ... + \beta_{p}X_{p}}\]
(applying the inverse of our link function to the linear predictor)
RIn our summary() output, let’s look at the coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 3.35545435 0.083696926 40.090532 0.000000e+00
dad_age 0.02580802 0.002751444 9.379809 6.609266e-21
For a unit increase in dad_age, there is a multiplicative effect of \(e^{0.02581} = 1.0261\) on the response variable (mean # of DNMs).
In other words, the number of DNMs observed in a child born to a 25-year-old father is \(1.0261\) times larger than the number of DNMs observed in a child born to a 24-year-old father.
We only looked for mutations at genomic positions where at least 10 sequencing reads were aligned in mom, dad, and the child.
library(cowplot)
dnms$callable_bp_bill = dnms$autosomal_callable_fraction / 1e9
ggplot(dnms, aes(x=factor(sample_id), y=callable_bp_bill)) +
geom_col() +
labs(x = "Sample ID", y="Callable bp (billions)") +
coord_cartesian(ylim = c(2.45, 2.6)) +
theme_cowplot() +
theme(axis.text.x = element_blank(), axis.ticks.x = element_blank(), text = element_text(size = 24), axis.text.y = element_text(size=24))We were able to look for DNMs at ~2.6 billion positions in most children, but closer to ~2.4 billion in others.
Does this variable sequencing depth affect our observed DNM counts?
Our current model is:
\[\log(\mu) = \beta_0 + \beta_{1}X_{1}\]
where \(\mu\) is the expected number of DNMs observed in a child given \(X_1\), the age of the child’s father.
We actually want to model: \[\log(\frac{\mu}{N}) = \beta_0 + \beta_{1}X_{1}\]
where \(N\) is the number of genomic positions at which we had enough sequencing data in the trio to detect a mutation.
In other words, we want to model the expected number of DNMs in a child per “callable” base pair as a function of age.
Thanks to logarithm rules (remember those?),
\[\log(\frac{\mu}{N}) = \beta_0 + \beta_{1}X_{1}\]
\[\log(\mu) = \beta_0 + \beta_{1}X_{1} + \log(N)\]
In the context of Poisson GLMs, the \(\log(N)\) term is sometimes referred to as “exposure” or “offset.”
RIn R, we can easily update our linear model to account for an “offset.”
Recall that our new model is
\[\log(\mu) = \beta_0 + \beta_{1}X_{1} + \log(N)\]
where \(N\) is the number of callable base pairs per trio.
Call:
glm(formula = all_dnms ~ dad_age + offset(log(autosomal_callable_fraction)),
family = "poisson", data = dnms)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.2407 -0.7998 0.0082 0.7726 2.9893
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -18.193428 0.080329 -226.487 <2e-16 ***
dad_age 0.024550 0.002644 9.287 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 179.845 on 69 degrees of freedom
Residual deviance: 95.811 on 68 degrees of freedom
AIC: 521.3
Number of Fisher Scoring iterations: 4
The dad_age term is still significant, and the coefficient is very similar to its estimate in our “vanilla” Poisson model.
When fitting GLMs, you need to answer:
What distribution are my \(Y\) values expected to follow?
This will determine the family you choose in the glm function
How do I make the expected values of \(Y\) compatible with my linear predictor?
This will determine the link you choose
Be careful to interpret coefficients accordingly!
R methods for various types of data